Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gh 5121 fedx bind left join #5122

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

aschwarte10
Copy link
Contributor

@aschwarte10 aschwarte10 commented Sep 6, 2024

Note: The PR is still WIP (some further changes, and especially unit test coverage will follow) however it is ready for early review. The goal is to have this available in RDF4J 5.1

GitHub issue resolved: #5121

Briefly describe the changes proposed in this PR:

The PR consists of multiple commits which add an implementation for bind left joins. It is recommended to follow the change commit by commit (as the first ones are preparational refactoring)

Commit 1: GH-5121: refactor the bind join logic into a reusable base class

Refactor the existing logic for executing bind joins into a reusable
base class.

This change mostly moves the implementation logic from the existing
ControlledWorkerBindJoin class to a new intermediate implementation
(with the goal to make it reusable in a second step for left joins).

Note that the bind join implementation no longer uses the
ControlledWorkerJoin as base class, i.e. the decision of which join
implementation to use is moved to the strategy.

Commit 2: GH-5121: prepare execution of left joins in the federation strategy

Prepare to execute a specific implementation of a left join
implementation through the federation strategy.

Commit 3: GH-5121: implementation of left bind join operator

This change provides the implementation and activation for the left bind
join operator.

The algorithm is as follows:

  • execute left bind join using regular bound join query
  • process result iteration similar to BoundJoinVALUESConversionIteration
  • remember seen set of bindings (using index) and add original bindings
    to those, i.e. put to result return all non-seen bindings directly from
    the input

Note that the terminology in literature has changed to "bind joins".
Hence, for new classes and methods I try to follow that.

Change is covered with a unit tests

TODOs

  • further unit test coverage
  • validate check left bind join pattern
  • configurability to turn bind left joins off
  • ...

PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

Refactor the existing logic for executing bind joins into a reusable
base class.

This change mostly moves the implementation logic from the existing
ControlledWorkerBindJoin class to a new intermediate implementation
(with the goal to make it reusable in a second step for left joins).

Note that the bind join implementation no longer uses the
ControlledWorkerJoin as base class, i.e. the decision of which join
implementation to use is moved to the strategy.
Prepare to execute a specific implementation of a left join
implementation through the federation strategy.
This change provides the implementation and activation for the left bind
join operator.

The algorithm is as follows:

- execute left bind join using regular bound join query
- process result iteration similar to BoundJoinVALUESConversionIteration
- remember seen set of bindings (using index) and add original bindings
to those, i.e. put to result return all non-seen bindings directly from
the input

Note that the terminology in literature has changed to "bind joins".
Hence, for new classes and methods I try to follow that.

Change is covered with a unit tests
@aschwarte10 aschwarte10 added the 📦 fedx fedx: optimized federated query support label Sep 6, 2024
return configuredBindJoinSize;
}

protected interface TaskCreator {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, moving this interface up to the base class seems to cause a class incompatbility:

Sep 06, 2024 12:24:10 PM japicmp.output.incompatible.IncompatibleErrorOutput warn
WARNING: Incompatibility detected: Requires semantic version level MAJOR: JApiMethod [oldMethod=org.eclipse.rdf4j.federated.evaluation.join.ControlledWorkerBoundJoin.determineTaskCreator(org.eclipse.rdf4j.query.algebra.TupleExpr,org.eclipse.rdf4j.query.BindingSet), newMethod=org.eclipse.rdf4j.federated.evaluation.join.ControlledWorkerBoundJoin.determineTaskCreator(org.eclipse.rdf4j.query.algebra.TupleExpr,org.eclipse.rdf4j.query.BindingSet), returnType=JApiReturnType [oldReturnTypeOptional=org.eclipse.rdf4j.federated.evaluation.join.ControlledWorkerBoundJoin$TaskCreator, newReturnTypeOptional=org.eclipse.rdf4j.federated.evaluation.join.ControlledWorkerBindJoinBase$TaskCreator, changeStatus=MODIFIED], getCompatibilityChanges()=[METHOD_RETURN_TYPE_CHANGED]]
Sep 06, 2024 12:24:10 PM japicmp.output.incompatible.IncompatibleErrorOutput warn
WARNING: Incompatibility detected: Requires semantic version level MAJOR: JApiMethod [oldMethod=org.eclipse.rdf4j.federated.evaluation.join.ControlledWorkerBoundJoin.setSubmitFirstResultImmediately(boolean), newMethod=n.a., returnType=JApiReturnType [oldReturnTypeOptional=void, newReturnTypeOptional=value absent, changeStatus=REMOVED], getCompatibilityChanges()=[METHOD_REMOVED]]
Sep 06, 2024 12:24:10 PM japicmp.output.incompatible.IncompatibleErrorOutput warn
WARNING: Incompatibility detected: Requires semantic version level MAJOR: JApiSuperclass [jApiClass=JApiClass [fullyQualifiedName=org.eclipse.rdf4j.federated.evaluation.join.ControlledWorkerBoundJoin, changeStatus=MODIFIED, compatibilityChanges=[METHOD_DEFAULT_ADDED_IN_IMPLEMENTED_INTERFACE]], oldSuperclass=org.eclipse.rdf4j.federated.evaluation.join.ControlledWorkerJoin, newSuperclass=org.eclipse.rdf4j.federated.evaluation.join.ControlledWorkerBindJoinBase, changeStatus=MODIFIED, compatibilityChanges=[SUPERCLASS_REMOVED]]

I will try to find a solution that remeains binary compatible

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might create the interface of the same name in the parent class and extend this one from that? This should preserve compatibility.

Comment on lines +26 to +39
/**
* A {@link LookAheadIteration} for processing bind left join results.
*
* Algorithm:
*
* <ul>
* <li>execute left bind join using regular bound join query</li>
* <li>process result iteration similar to {@link BoundJoinVALUESConversionIteration}</li>
* <li>remember seen set of bindings (using index) and add original bindings to those, i.e. put to result return all
* non-seen bindings directly from the input</li>
*
*
* @author Andreas Schwarte
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that this is used for OPTIONAL clauses. Is it used for anything else?

protected BindingSet convert(BindingSet bIn, int bIndex) throws QueryEvaluationException {
QueryBindingSet res = new QueryBindingSet();
Iterator<Binding> bIter = bIn.iterator();
while (bIter.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a for-each look?

for (Binding bs : bIn){
  ...
}

Comment on lines +60 to +61
int bIndex = Integer.parseInt(
bIn.getBinding(BoundJoinVALUESConversionIteration.INDEX_BINDING_NAME).getValue().stringValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the binding Value object ever a literal? So it would be possible to check if it's instanceof IntegerLiteral or something like that?


protected final ControlledWorkerScheduler<BindingSet> scheduler;

protected final Phaser phaser = new Phaser(1);
Copy link
Contributor

@hmottestad hmottestad Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen in other code that we call phaser.forceTermination() in the handleClose method. Would that be wise here too?

protected final CloseableIteration<BindingSet> iter;
protected final List<BindingSet> bindings;

protected Set<Integer> seenBindingIndexes = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the IntHashSet be appropriate here?

@aschwarte10
Copy link
Contributor Author

@hmottestad thanks for the feedback. I will get back to this PR soon (we are currently in the release finalization phase of our product and this requires most of my attention).

Quick question: is there a timeline for the 5.1 release? I would like to see this change included (if I get it ready) such that we can make use of it in our next release. Ideally the 5.1 release would be available in the second half of november. What are the current project plans?

@hmottestad
Copy link
Contributor

We don't have any plans at the moment.

I'm currently working on the last SHACL paths that aren't supported. I also have a PR for configuring the Apache HttpClient timeouts. And there is a request for adding support for configuring the document loader in the new JSON-LD 1.1 parser.

Would really want the timeouts included, and if I start on the JSON-LD document loader then it shouldn't take more than a few weeks to get it merged.

So we can say mid november for RDF4J 5.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📦 fedx fedx: optimized federated query support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants